What is Object Detection? Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
Papers and Code
Feb 04, 2025
Abstract:Collaborative Object Detection (COD) and collaborative perception can integrate data or features from various entities, and improve object detection accuracy compared with individual perception. However, adversarial attacks pose a potential threat to the deep learning COD models, and introduce high output uncertainty. With unknown attack models, it becomes even more challenging to improve COD resiliency and quantify the output uncertainty for highly dynamic perception scenes such as autonomous vehicles. In this study, we propose the Trusted Uncertainty Quantification in Collaborative Perception framework (TUQCP). TUQCP leverages both adversarial training and uncertainty quantification techniques to enhance the adversarial robustness of existing COD models. More specifically, TUQCP first adds perturbations to the shared information of randomly selected agents during object detection collaboration by adversarial training. TUQCP then alleviates the impacts of adversarial attacks by providing output uncertainty estimation through learning-based module and uncertainty calibration through conformal prediction. Our framework works for early and intermediate collaboration COD models and single-agent object detection models. We evaluate TUQCP on V2X-Sim, a comprehensive collaborative perception dataset for autonomous driving, and demonstrate a 80.41% improvement in object detection accuracy compared to the baselines under the same adversarial attacks. TUQCP demonstrates the importance of uncertainty quantification to COD under adversarial attacks.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:This study explores the challenges of integrating human visual cue-based dehazing into object detection, given the selective nature of human perception. While human vision adapts dynamically to environmental conditions, computational dehazing does not always enhance detection uniformly. We propose a multi-stage framework where a lightweight detector identifies regions of interest (RoIs), which are then enhanced via spatial attention-based dehazing before final detection by a heavier model. Though effective in foggy conditions, this approach unexpectedly degrades the performance on clear images. We analyze this phenomenon, investigate possible causes, and offer insights for designing hybrid pipelines that balance enhancement and detection. Our findings highlight the need for selective preprocessing and challenge assumptions about universal benefits from cascading transformations.
* arXiv admin note: substantial text overlap with arXiv:2410.01225
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.
* Videos of STRING-based robotics controllers can be found here:
https://sites.google.com/view/string-robotics
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:Bird's-eye view (BEV) perception for autonomous driving has garnered significant attention in recent years, in part because BEV representation facilitates the fusion of multi-sensor data. This enables a variety of perception tasks including BEV segmentation, a concise view of the environment that can be used to plan a vehicle's trajectory. However, this representation is not fully supported by existing datasets, and creation of new datasets can be a time-consuming endeavor. To address this problem, in this paper we introduce SimBEV, an extensively configurable and scalable randomized synthetic data generation tool that incorporates information from multiple sources to capture accurate BEV ground truth data, supports a comprehensive array of sensors, and enables a variety of perception tasks including BEV segmentation and 3D object detection. We use SimBEV to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model's memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model's reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.
* This paper is accepted by ICLR 2025
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:In this work, we present INTACT, a novel two-phase framework designed to enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data in safety-critical perception tasks. INTACT combines meta-learning with adversarial curriculum training (ACT) to systematically address challenges posed by data corruption and sparsity in 3D point clouds. The meta-learning phase equips a teacher network with task-agnostic priors, enabling it to generate robust saliency maps that identify critical data regions. The ACT phase leverages these saliency maps to progressively expose a student network to increasingly complex noise patterns, ensuring targeted perturbation and improved noise resilience. INTACT's effectiveness is demonstrated through comprehensive evaluations on object detection, tracking, and classification benchmarks using diverse datasets, including KITTI, Argoverse, and ModelNet40. Results indicate that INTACT improves model robustness by up to 20% across all tasks, outperforming standard adversarial and curriculum training methods. This framework not only addresses the limitations of conventional training strategies but also offers a scalable and efficient solution for real-world deployment in resource-constrained safety-critical systems. INTACT's principled integration of meta-learning and adversarial training establishes a new paradigm for noise-tolerant 3D perception in safety-critical applications. INTACT improved KITTI Multiple Object Tracking Accuracy (MOTA) by 9.6% (64.1% -> 75.1%) and by 12.4% under Gaussian noise (52.5% -> 73.7%). Similarly, KITTI mean Average Precision (mAP) rose from 59.8% to 69.8% (50% point drop) and 49.3% to 70.9% (Gaussian noise), highlighting the framework's ability to enhance deep learning model resilience in safety-critical object tracking scenarios.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 03, 2025
Abstract:Oriented Object Detection (OOD) has received increased attention in the past years, being a suitable solution for detecting elongated objects in remote sensing analysis. In particular, using regression loss functions based on Gaussian distributions has become attractive since they yield simple and differentiable terms. However, existing solutions are still based on regression heads that produce Oriented Bounding Boxes (OBBs), and the known problem of angular boundary discontinuity persists. In this work, we propose a regression head for OOD that directly produces Gaussian distributions based on the Cholesky matrix decomposition. The proposed head, named GauCho, theoretically mitigates the boundary discontinuity problem and is fully compatible with recent Gaussian-based regression loss functions. Furthermore, we advocate using Oriented Ellipses (OEs) to represent oriented objects, which relates to GauCho through a bijective function and alleviates the encoding ambiguity problem for circular objects. Our experimental results show that GauCho can be a viable alternative to the traditional OBB head, achieving results comparable to or better than state-of-the-art detectors for the challenging dataset DOTA
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 03, 2025
Abstract:Accurate and robust 3D object detection is essential for autonomous driving, where fusing data from sensors like LiDAR and camera enhances detection accuracy. However, sensor malfunctions such as corruption or disconnection can degrade performance, and existing fusion models often struggle to maintain reliability when one modality fails. To address this, we propose ReliFusion, a novel LiDAR-camera fusion framework operating in the bird's-eye view (BEV) space. ReliFusion integrates three key components: the Spatio-Temporal Feature Aggregation (STFA) module, which captures dependencies across frames to stabilize predictions over time; the Reliability module, which assigns confidence scores to quantify the dependability of each modality under challenging conditions; and the Confidence-Weighted Mutual Cross-Attention (CW-MCA) module, which dynamically balances information from LiDAR and camera modalities based on these confidence scores. Experiments on the nuScenes dataset show that ReliFusion significantly outperforms state-of-the-art methods, achieving superior robustness and accuracy in scenarios with limited LiDAR fields of view and severe sensor malfunctions.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 04, 2025
Abstract:This paper introduces RoSe, the first-of-its-kind ML/Crypto codesign watermarking framework that regulates LLM-generated code to avoid intellectual property rights violations and inappropriate misuse in software development. High-quality watermarks adhering to the detectability-fidelity-robustness tri-objective are limited due to codes' low-entropy nature. Watermark verification, however, often needs to reveal the signature and requires re-encoding new ones for code reuse, which potentially compromising the system's usability. To overcome these challenges, RoSe obtains high-quality watermarks by training the watermark insertion and extraction modules end-to-end to ensure (i) unaltered watermarked code functionality and (ii) enhanced detectability and robustness leveraging pre-trained CodeT5 as the insertion backbone to enlarge the code syntactic and variable rename transformation search space. In the deployment, RoSe uses zero-knowledge proofs for secure verification without revealing the underlying signatures. Extensive evaluations demonstrated RoSe achieves high detection accuracy while preserving the code functionality. RoSe is also robust against attacks and provides efficient secure watermark verification.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)
Feb 03, 2025
Abstract:While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
Via
![arxiv icon](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Farxiv.41e50dc5.png&w=128&q=75)